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Abstract — We consider a problem on the synthesis of reactive 
controllers that optimize some a priori unknown performance 
criterion while interacting with an uncontrolled environment 
such that the system satisfies a given temporal logic specifica¬ 
tion. We decouple the problem into two subproblems. First, we 
extract a (maximally) permissive strategy for the system, which 
encodes multiple (possibly all) ways in which the system can re¬ 
act to the adversarial environment and satisfy the specifications. 
Then, we quantify the a priori unknown performance criterion 
as a (still unknown) reward function and compute an optimal 
strategy for the system within the operating envelope allowed 
by the permissive strategy by using the so-called maximin-Q 
learning algorithm. We establish both correctness (with respect 
to the temporal logic specifications) and optimality (with respect 
to the a priori unknown performance criterion) of this two- 
step technique for a fragment of temporal logic specifications. 
For specifications beyond this fragment, correctness can still 
be preserved, but the learned strategy may be sub-optimal. We 
present an algorithm to the overall problem, and demonstrate 
its use and computational requirements on a set of robot motion 
planning examples. 

I. INTRODUCTION 

The goal of this paper is to synthesize optimal reactive 
strategies for systems with respect to some unknown per¬ 
formance criterion and in an adversarial environment such 
that given temporal logic specifications are satisfied. The 
consideration of unknown performance criterion may seem 
unreasonable at first sight, but it turns out to be an effective 
supplement to the specification as task description and suits 
the need in many applications. On the one hand, general re¬ 
quirements on system behaviors such as safety concerns and 
task rules may be known and expressed as specifications in 
temporal logic. On the other hand, quantitative performance 
criterion can help encode more subtle considerations, such 
as specific intentions for the current application scenario and 
personal preferences of human operators who work with the 
autonomous system. For a path planner of autonomous vehi¬ 
cles, specifications imply fixed nonnegotiable constraints like 
safety requirements, e.g., always drive on the correct lane, 
never jump the red light and eventually reach the destination. 
Quantitative performance criteria give preferences within the 
context constrained by the specifications, which may involve 
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considerations that have not been taken into account during 
controller design and suggested by the human operators. 

The two main topics most relevant to our work are reactive 
synthesis with temporal logic specifications and reinforce¬ 
ment learning with respect to unknown performance criteria. 
Neither solves the problem we consider in this paper. 

On the synthesis side, early work focused on planning 
in static known environments [13], [25], Reactivity to the 
changes in dynamic environments is a crucial functionality. 
For example, the environment of an autonomous vehicle 
involves the other vehicles and pedestrians moving nearby, 
and it is impractical to expect an autonomous vehicle to 
run on roads safely without reacting to its surrounding 
environment in real time. Recently, references [15], [16], [17] 
considered possibly adversarial environments and reactive 
strategies (without any quantitative performance criteria). 

Another concern in synthesis is optimality with respect 
to a given performance criterion. Optimal strategies have 
been studied with respect to given objectives while satisfying 
some temporal logic specifications, mostly in determinis¬ 
tic environments or stochastic environments with known 
transition distribution [4], [27]. Both qualitative objectives 
such as correctness guarantee with respect to an adversarial 
environment and quantitative objectives such as mean payoffs 
were studied in [3] though these results crucially rely on the 
quantitative measure being known a priori. 

In order to deal with problems with a priori unknown 
performance criterion, it is intuitive to gain experience from 
direct interactions with the environment or with a human 
operator, which coincides with the motivation of many re¬ 
inforcement learning methods [22], Multiple learning meth¬ 
ods have been studied and are available to problems with 
unknown reward functions and incomplete prior knowledge 
on system models [1], [18], [21], [26], and have been used 
in many applications, including the famous TD-Gammon 
example [23] and robot collision avoidance [8]. However, the 
learning process generally cannot guarantee the satisfaction 
of other independently imposed specifications while maxi¬ 
mizing the expected rewards at the same time, though they 
can be modified to deal with some simple cases [7], [14]. 

To the best of our knowledge, the current paper is the 
first to deal with the problem of synthesizing a controller 
which optimizes some a priori unknown performance cri¬ 
terion while interacting with an uncontrolled environment 


in a way that satisfies the given temporal logic specifica¬ 
tions. The approach we take is based on a decomposition 
of the problem into two subproblems. For the first part 
(Section IIV-Ab . the intuition is to extract a strategy for the 
system, namely a permissive strategy [2], which encodes 
multiple (possibly all) ways in which the system can react 
to the adversarial environment and satisfy the specifications. 
Then in the second part (Section IIV-Bb . we quantify the a 
priori unknown performance criterion as a (still unknown) 
reward function and apply the idea of reinforcement learning 
to choose an optimal strategy for the system within the 
operating envelope allowed by the permissive strategy. By 
decoupling the optimization problem with respect to the 
unknown cost from the synthesis problem, we manage to 
synthesize a strategy for the system that is guaranteed to 
both satisfy the specifications and reach optimality over a set 
of winning strategies with respect to the a priori unknown 
performance criterion (Section II V-Cb . 

II. Preliminaries 

We now introduce some basic concepts used in the rest. 
A. Two-player games 

First we model the setting as a two-player game. In this 
model we care about not only the controlled system, but also 
its external uncontrolled environment. Interactions between 
the controlled system and the uncontrolled environment 
play a critical role in guaranteeing the correctness of given 
specifications, as we will discuss later. 

Definition 1. A two-player game, or simply a game, is 
defined as a tuple Q = (S, S s , S e , I, A Cl A UC ,T,W), where 
S is a finite set of states; {S' s ,S' e } is a partition of S, i.e., 
S = S s U S e , S s nS e =V; ICS is a set of initial states; 
A c is a finite set of controlled actions of the system; A uc is 
a finite set of uncontrolled actions for the environment and 
A uc fl A c = 0; T : S x {A c [J A uc } —y 2 s is a transition 
function; W is the winning condition defined later. 

S s and S e are the sets of states from which it is the sys¬ 
tem’s or the environment’s turn to take actions, respectively. 
There are no available uncontrolled actions (environment 
actions) to any state s C S s , and correspondingly, states in 
S e can not respond to any controlled action (system action). 
Let A(s) be the set of actions available at state s £ S. Hence 
.A(s) C A c if s £ S s and A(s) C A uc if s £ S e . 

If the transition function T of Q satisfies \T(s, a)| < 1 for 
all s £ S and a £ A(s), the game is called deterministic, 
otherwise the game is called non-deterministic, highlighting 
the fact that multiple actions can be available to some states. 
We assume here that Q is deterministic. 

A run 7r = S 0 S 1 S 2 ... of f/ is an infinite sequence of states 
such that so G S and for igN, there exists at £ A(si) such 
that Sj+i = T(si,cii). Without loss of generality, assume that 
all states are reachable from I in Q. 


B. Linear temporal logic 

We use fragments of linear temporal logic (LTL) to specify 
the assumptions on environment behaviors and the require¬ 
ments for the system. LTL can be regarded as a generaliza¬ 
tion of propositional logic. In addition to logical connectives 
such as conjunction (A), disjunction (V), negation (- 1 ) and 
implication (—>•), LTL also includes basic temporal operators 
such as next (OX until (U), derived temporal operators like 
always (□) and eventually (<0>), and any (nested) combination 
of them, like always eventually (□<)). 

An atomic proposition is a Boolean variable (or propo¬ 
sitional variable). Suppose AP is a finite set of atomic 
propositions, then we can construct LTL formulas as follows: 

(i) Any atomic proposition p £ AP is an LTL formula; 

(ii) given formulas pi and p 2 , ~>Pi, Pi A p 2 , Qpi an d 
PiUp 2 are LTL formulas. A formula without any temporal 
operators is called a Boolean formula or assertion. A linear 
time property is a set of infinite sequences over 2 AP . 

LTL formulas are evaluated over executions: An execution 
a = do, <7i, 02 ,... is an infinite sequence of truth assign¬ 
ments to variables in AP, where a is the set of atomic 
propositions that are True at position i £ N. Given an 
execution a and an LTL formula <p, the conditions that <p 
holds at position i of a, denoted by a, i \= p, are constructed 
inductively as follows: (i) Let P(p) be the set of atomic 
propositions appearing in tp. Then for any p £ P(p), <J,i (= p 
iff p £ <7j. (ii) a,i |= ~^p iff a,i p. (iii) a, i \= Q)p 
iff a,i + 1 (= tp. (iv) If tp = pi A tp 2 , then a,i \= p iff 
< 7 , i |= pi and cr,i \= p 2 . (v) If p = pi V p 2 , then a,i\= p 
iff a,i |= pi or cr,i |= p 2 . (vi) If p = [tpi —y p 2 ), then 
a, i |= p iff <7,i\= pi implies a, i f= p 2 .{vi\) If p = pfiUp 2 , 
then <7,1 \= p iff there exists k > i such that a, j \= pi 
for all i < j < k and < 7 , k (= p 2 . (vii) ()p = True Up, 
□ p = ->(f->p. We say that p holds on cr or a satisfies p, 
denoted by cr (= p, if cr, 0 |= p. 

An LTL formula p-\ is a safety formula if for every 
execution a that violates pi, there exists an i £ N + such that 
for every execution a' that coincides with a up to position i, 
<7' also violates pi. An LTL formula p 2 is a liveness formula 
if for every prefix of any execution op,..., cr; (* > 0), 
there exists an infinite execution a' with prefix op,..., cr* 
such that <7' \= p 2 . Intuitively, safety formulas indicate that 
“something bad should never happen,” and liveness formulas 
require that “good things will happen eventually.” 

Let AP be a set of atomic propositions, and define a 
labeling function L : S — »• 2 AP such that each state 
s £ S is mapped to the set of atomic propositions that hold 
True at state s. A word is an infinite sequence of labels 
L(tt) = L(s 0 )L(si)L(s 2 ) ■ ■ ■ where 7r = S 0 S 1 S 2 ... is a run 
of Q. We say a run n satisfies p if and only if L(w) |= p. 

To complete the definition of two-player games, define 
the winning condition W = ( L , p) such that L is a labeling 
function and p is an LTL formula, and a run tt of Q is 








winning for the system if and only if it satisfies p. ip can be 
used to express the qualitative specifications such as system 
requirements and environment assumptions. 

C. Control strategies 

Given the game Q , we would like to synthesize a control 
protocol such that the runs of Q satisfy the specification p. 

A (deterministic) memoryless strategy for the system is a 
map p : S s —> A c , where p(s) £ A(s) for all s £ S s . 
A (deterministic) finite-memory strategy for the system is a 
tuple /i = (p m , p m ,M) where p m : S s x M —> A c such 
that p m (s,m) £ A(s) for all s £ S s , m £ M, and p m : 
S x M —f Ad. The finite set M is called the memory and p m 
is also called the memory update function. p m (s,m) £ A(s) 
for all s £ S s and m £ Ad. m is initialized to be £ Ad. 
Strategies can also be defined as non-deterministic, in which 
case p will be defined as p, : S s —t 2 Ac for memoryless 
strategies or p = (p m , Pm, Ad) with p m : S s xA-I —» 2 Ac for 
finite-memory strategies. Clearly deterministic strategies can 
be regarded as a special case of non-deterministic strategies 
when M is a singleton. We require \p m (s,m)\ = 1 for all 
s £ S and m £ Ad, no matter the strategy is deterministic or 
not. p m will be evaluated each time after any player takes 
action. If we further specify the probability distribution P 
over A(s) for each state s £ S s , the corresponding strategies 
are called randomized strategies. We refer to deterministic 
strategies unless otherwise stated. By replacing S s by S e and 
A c by A uc , we can define memoryless and finite-memory 
strategy for the environment. 

A run 7 r = S 0 S 1 S 2 ... is induced by a strategy p for 
the system if for any i £ N such that s,; £ S s , Sj+i = 
T(si, p(si)) (for memoryless strategies) or there exists an 
infinite sequence of memories ?no?TiiTO 2 ... £ such 
that = T(si, p m (si,mi)) and for all Sj £ S, rrij+i = 
p m (sj+i, mf) (for finite-memory strategies). Let RP(s) be 
the set of runs of Q induced by a strategy p for the system 
and initialized with s £ S. |f? M (s)| > 1 when the strategies 
for the environment are not unique, even if p is deterministic. 

We say a strategy p for the system wins at state s £ S if 
all runs n £ ft M (s) are winning for the system. A strategy p 
is called a winning strategy if it wins at all initial states of 
Q. A formula ip is realizable for Q if there exists a winning 
strategy for the system with W = (L, ip). 

D. Reward functions 

Besides qualitative requirements which are encoded in the 
winning condition, we also consider quantitative evaluation 
from other sources such as the human operators. Such 
evaluation is modeled as a reward function which we want 
to maximize by choosing proper strategy for the system. 

In order to evaluate the system strategy, we first map each 
system state-action pair to a nonnegative value by an instan¬ 
taneous reward function 7Z : S s x (A c IJ A uc ) —y R + U{0}’ 
and then consider the “accumulation” of such instantaneous 


rewards obtained over a run of a game Q. As runs are of 
infinite length, we cannot simply add all the instantaneous 
reward acquired, which may approach infinity. Instead we 
define a reward function J ^ : S u —> R to compute reward 
for any run tt of Q given the instantaneous reward function 
1Z. A common example of ,/.£ is the discounted reward 

OO 

J%' = Y,T kR k+U (D 

k -0 

where 7 is a discount factor satisfying 0 < 7 < 1 , and Rk+i 
is the (fc+l)th instantaneous reward obtained by the system. 
In this case, rewards acquired earlier are given more weight, 
while in other examples like the mean payoff function 
liminffc-Hx, we ights on instantaneous reward 

are independent of the sequence. 

Now we define a reward function -fif : 'P x S s 
R + |J{ 0 } to evaluate each strategy for the system, where 
V is the set of system strategies. Usually |f? At (s)| > 1 as 
the uncontrolled environment has more than one strategies, 
and thus the definition of J^(p, s) is not unique given 
</£( 7 r) for all runs in R^(s). One commonly used choice 
for is the expectation of j£( tt) over all runs in RP(s) 
with some given distribution for the environment strategy, 
i.e., E 7 r£i^(s) [4 ( 77 )] . The distribution is usually estimated 
from interaction experience with the environment. Another 
common way is to define as the minimal possible reward 
acquired when the system strategy is p, i.e., 

4(fb s ) = inf 4( 7r )> ( 2 ) 

ttER ^(s) 

which we use as the reward function in our problem. 

III. Problem formulation and overview of the 

SOLUTION APPROACH 

We have modeled the interaction between the uncontrolled 
environment and the controlled system as a two-player 
game whose winning condition is described by a given LTL 
formula. Moreover, we defined reward functions to evaluate 
the performance of different system strategies. Now we can 
go on to formulate the main problem of the paper. 

Problem 1. A two-player deterministic game Q = 
(S, S a , S e , I, A c , A uc , T , W) is given where W = ( L , p) and 
p is realizable for Q. Find a memoryless or finite-memory 
winning strategy p for the system such that J%[p, s) is 
maximized for all s £ I, where a reward function Jj^ is given 
with respect to an unknown instantaneous reward function 

ft : x (A c U Axe) ^ R + U{0}- 

Generally, there does not necessarily exist a memo¬ 
ryless or finite-memory winning strategy that maximizes 
the reward over all winning strategies, as it is possi¬ 
ble that the instantaneous reward promotes the system 
to violate the specification. Take as an example Go = 




(. S,S s ,S e ,I,A c ,A uc ,T,W ), where S e = <b, S = S s = 
I = {so,Si},A c = {a 0 , ai, a 2 }, A uc = 0 and W = 
(L,p). The transition function T and the labeling function 
L are shown in Fig. [I] and the formula is ip = f/h-i- 
The game Go does 
have winning 

strategies for the 
system. For example, 
the strategy p where 
/i(s 0 ) = ai,/i(si) = 
a 2 is a memoryless 
winning strategy, and 
the strategy p' = (p' m ,p' m ,{0 ,1}) where p' m (si,0) = 
Mm( s i! !) = {^2}. Mm(s 0 ,0) = {a 0 , ai}, Mm(so, 1) = {at} 
and Mm(si, 0) = 0, p^(si,l) = /t4(s o ,0) = p' m (s 0 ,l) = 1, 
is a finite-memory winning strategy. 

But Go does not have memoryless or finite-memory op¬ 
timal winning strategies for the system. Suppose the un¬ 
known instantaneous reward function is actually defined as 
1Z(si,a2) = 0, lZ(so,ao) = lZ(so,ai) = 10, and the 
reward function is the same as (01. In order to maximize 
j£(/x, ■)> p should allow the system to stay at so as many 
times as possible, but staying at so forever will violate <p. 
Thus optimal winning strategies need infinite memory. 

Let us now move on to an overview of the two-stage 
solution approach we propose. Given a game Q as in Prob¬ 
lem [T] we first extract a non-deterministic winning strategy 
p v called a permissive strategy [2], which guarantees that 
R»(s) C f? A ‘p(s) for all memoryless winning strategies p 
and s £ I. In some special cases (e.g. the conditions in 
Proposition 0 , we are even able to compute a maximally 
permissive strategy p™ ax , such that i? M ( s ) C f? M p (s) for 
all winning strategies p and s £ I. Then in the second 
stage we restrict to the transitions allowed by p p (or p™ ax ), 
apply reinforcement learning methods to explore the a priori 
unknown instantaneous reward function 1 Z and compute an 
optimal strategy over all strategies of the new game obtained 
in the first stage. With this decomposition we managed to 
separate the problem of guaranteeing the correctness of spec¬ 
ifications from that of seeking the optimality of the reward 
function with a priori unknown instantaneous rewards. 

IV. Permissive strategies, learning and the main 

ALGORITHM 

This section is composed of three parts. We first introduce 
the idea of permissive strategies, then describe a reinforce¬ 
ment learning method which is used to learn an optimal 
strategy with respect to an unknown reward function without 
concern about any specification, and finally combine the 
two parts to apply the reinforcement learning method to 
explore for an optimal strategy out of those encoded in an 
appropriately constructed permissive strategy. 
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Fig. 1 : A game Go without finite- 
memory optimal strategy. 
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A. Extraction of permissive strategies 

We first introduce an inclusion relation between strategies. 
Recall that we have defined the set of runs induced by 
a strategy p for the system with initial state s £ I as 
For two non-deterministic strategies pi and P2 for 
the system, we say that p± includes P2 if f? Al2 (s) C f? /il (s) 
holds for all s £ I. Furthermore, if pi includes p.-j and p-2 
includes pi, we call pi and po equivalent. In other words, 
equivalent strategies induce the same set of runs. A game G 
has a unique winning strategy if all its winning strategies are 
equivalent. Now we can define permissive strategies based 
on this strategy inclusion relation. 

Definition 2. Given a two-player game G, a non- 
deterministic strategy p for the system is called permissive 
if (i) it is winning for the system and (ii) includes all 
memoryless winning strategies for the system. A permissive 
strategy is called maximally permissive if it includes all 
winning strategies for the system. 

All two-player games have permissive strategies. For 
games with finite states, there are only finitely many memo¬ 
ryless winning strategies. We can build a permissive strategy 
by adding a unique tag to each of them (as memory) and 
directly combining them together. In cases where there is 
no memoryless winning strategy, this fact is trivial as any 
winning strategy is permissive. 

In general, permissive strategies are not necessarily 
unique. For example, the game Go in Fig- Q] has a unique 
memoryless winning strategy p for the system where 
p(so) = a\, p(s\) = a2- As a result, p p such that 
Mp( s o) = {«t} and p P {s 1) = {02} is a deterministic 
memoryless permissive strategy. In the meantime, the finite- 
memory strategy p! = (p’ m , p' m , {0,1}) where ^(si,0) = 
Mm(s 1 ,1) = { 02 }. Mm(so,0) = {a 0 ,ai}, p' m (s 0 , 1) = {«i} 
and p' m (s 0 , 0) = p' m {s 0 , 1) = p^(si, 1) = 0, 0) = 0 

includes p and thus is also a permissive strategy. As p p does 
not include p!, they are different permissive strategies of Q. 

On the other hand, maximally permissive strategies must 
be unique by definition, if they exist for a game G- The 
specific representations of maximally permissive strategies 
may not be unique, just like a memoryless strategy can be 
rewritten as a finite-memory strategy in which the allowed 
actions are independent of the memory. 

It is naturally desirable to extract maximally permissive 
strategies as they include all the other winning strategies. 
While they do not exist in general, the following proposition 
is a sufficient condition of their existence. 

Proposition 1 ([ 2 ]). All games G with winning conditions 
W = ( L , ip) in which ip is a safety formula have memory less 
or finite-memory maximally permissive strategies. 

This characterization can be extended to be both sufficient 
and necessary. It has been shown that precisely for the re- 



active safety properties [5], maximally permissive strategies 
exist. These are the properties equivalent to safety properties 
when the interaction between the environment and system 
is explicitly considered, i.e., reactive safety characterizes 
precisely the properties whose satisfaction can be checked 
by testing if the runs of Q satisfy some safety formula. 

There exists work on the construction of permissive strate¬ 
gies for games Q with a general LTL formula p in the win¬ 
ning condition [ 2 ], [ 20 ], so we only sketch the relevant results 
briefly here. The first step is to compute a deterministic parity 
automaton [24] from p, which is taken into account by con¬ 
structing a new game Q'. O' uses a parity winning condition 
for tp', which is of the form /\ 0 <c<n (nOA' 2 . l A 0DA' 2 .i + i) 
for some state formulas K\ , • ■ • , A' 2n +t that mutually imply 
each other, i.e., for which for all 0 < i < 2 n, K l+ \ —> A', 
is a valid LTL formula [20]. Games with such a winning 
condition have permissive strategies and Bernet et al. [2] 
provided a construction method to compute such strategies. 

Additionally, Ehlers and Finkbeiner [5] offer a method 
for checking if the winning condition W is a reactive safety 
property for Q. The game Q is first translated into a parity 
automaton .(/ as before, and is then used to construct a 
parity tree automaton srf' [24]. Tree automata are commonly 
used to explicitly model inputs and outputs and the overall 
behavior of reactive systems. For reactive safety properties, 
trees get rejected if and only if some path in the tree visits 
some violating states, i.e., the states from which all trees are 
rejected. In other words, the set of accepted trees should be 
exactly those that never visit any violating state in all paths. 
The acceptance of trees can be decided by simply checking 
the set of states they can visit. The problem of checking if p 
is a reactive safety property for srf is reduced to checking the 
equivalence of two parity tree automata, which can be solved 
with existing approaches [9]. If we get a positive answer, 
we can construct another game with a safety formula in its 
winning condition which accepts exactly the same set of runs 
as ,<•/. By Proposition Q] there exists a maximally permissive 
strategy for Q. The worst-case complexity of the resulting 
method is 2-EXPTIME. 

Although Proposition [I] guarantees the existence of a 
maximally permissive strategy p™ ax when p is a safety 
formula, the computational time complexity is the same as 
that of synthesizing a strategy for a game with a general LTL 
formula, which is 2-EXPTIME [10]. It can be significantly 
improved when p is of the following special form. The proof 
is straightforward and is omitted due to the limited space. 

Proposition 2. For all games Q with winning condition 
W = ( L , p) and p = p o A \Ap\, where po and p\ are 
Boolean formulas of p and Qqforp,q £ AP, a memory less 
maximally permissive strategy can be solved in linear time 
of the number of transitions of Q and the size of p. 

We use the software tool slugs [6] to extract permissive 


strategies when p in Q is in the form of generalized reactivity 
(1) (GR(1)) [15]. Under the condition of Proposition [J 
slugs synthesizes a maximally permissive strategy. 

The extraction and application of permissive strategies 
greatly simplify the solution of Problem [T] enabling us 
to focus on optimizing the performance through strategies 
known to be correct. By Definition [2] a permissive strategy 
H p is non-deterministic and thus its application to a game 
0 is essentially encoding its memory update function into 
the game structure and removing all transitions that it does 
not allow. Hence any run tr' of the resulting game O' has a 
unique counterpart 7 r in the runs of Q induced by p p , and vice 
versa. Moreover, such it and it' can only be winning for the 
system simultaneously. Since p, p is winning for the system 
in 0 , all runs it induces are winning for the system and so 
are their counterpart runs in Q' . As a result, any strategy p! 
of O' is winning for the system. Let (it') be the same as 
J^(7r), and J,£ is defined similarly as J^. 

B. Reinforcement learning 

Now that we have acquired a game O' whose runs are 
all guaranteed to be correct with respect to the underlying 
linear temporal logic specification, we can move on to learn 
an optimal strategy with respect to an a priori unknown 
instantaneous reward function 1Z. The reinforcement learning 
algorithm aim to maximize J.£ for the game O'. 

The choice of reinforcement learning algorithms depends 
on the choice of the reward function .Z£ in Problem [Q 
regardless of how the permissive strategy p p is generated. 
Here we focus on discounted reward functions, but the 
pseudo-algorithm in Section IIV-CI also works with other 
forms of Jijl so long as there exists an optimal deterministic 
winning strategy p! which can be solved by the correspond¬ 
ing reinforcement learning method. 

The discounted reward function for evaluating the rewards 
obtained by a run is shown in ([I]). We particularly focus on 
the minimal (worst-case) possible reward for each system 
strategy, as shown in (0). This definition concerns about 
the tight lower bound of the reward obtained by executing 
strategy p! whatever strategy the uncontrolled environment 
implements. In other words, we assume that the environment 
acts adversarially and the game is equivalently a zero- 
sum game. It has been shown that in this case both the 
environment and the system have deterministic memoryless 
optimal strategies in O’ [19], As a result we can neglect all 
randomized strategies without loss of optimality. Such an 
optimal strategy can be computed by the maximin-Q algo¬ 
rithm, which is a simple variation of the minimax-Q learning 
algorithm [11]. It can also be solved by the generalized Q- 
learning algorithm for alternating Markov games [12]. Both 
methods guarantee that the learned greedy strategy, which 
always chooses an action with the best learned Q value, 
converges to an optimal strategy for a system interacting with 




an adversary under some common convergence conditions. 
C. Connecting the dots: correct-by-synthesis learning 
Having discussed permissive strategies and reinforcement 
learning, we are now ready to connect the pieces and discuss 
a solution to Problem [Q which is outlined in Algorithm |T] 
Maximally permissive strategies play a special role as they 
include all winning strategies for the system, and their 
existence naturally divide the solution into two cases. 

Algorithm 1 Pseudo-algorithm for solving Problem Q] 

Input: A game Q = (S, S e , S s , I, A c , A uc , T, W) with 
W = ( L, ip) in which <p is a realizable formula for Q, 
a reward function and Jj, (e.g. as in (Q}, (0)) with 
respect to an unknown instantaneous reward function TZ. 
Output: A winning strategy p for the system that maximizes 
,7£(/r. s) for all s G I. 

Step 1. Compute a (maximally) permissive strategy p.^. 
Step 2. Apply p p to Q and modify Q into a new game Q = 
(S,S a ,S e ,i,A c ,A uc ,T,W), where W = (. L,True ). 
Step 3. Compute p* that maximizes ^ for all s £ 
I with some reinforcement learning algorithm (e.g. the 
maximin-Q algorithm). 

Step 4. Map p* in Q back to p* in Q. 
return p*. 

1) For games whose maximally permissive strategies can 
be computed: If maximally permissive strategies can be 
computed for a game Q , p p in Algorithm [I] includes all 
winning strategies and is a winning strategy itself. Applying 
it to Q not only guarantees winning for the system but also 
preserves all winning strategies for the system in all sub¬ 
sequent steps, which decouples the correctness requirements 
from optimality concerns. As the output of the reinforcement 
learning algorithm used in Step 3 is guaranteed to converge 
to an optimal deterministic winning strategy, the output of 
Algorithm [j] is guaranteed to be a solution of Problem Q] 
Theorem [3 summarizes this result in a special case. 

Theorem 1. If the conditions in Proposition \2} hold, the 
output of Algorithm [7] is a solution to Problem\7\ 

2) For games whose maximally permissive strategies can¬ 
not be computed: If maximally permissive strategies for a 
game Q are not solvable, the best we can expect is to extract 
a permissive strategy which includes a proper subset of win¬ 
ning strategies for the system. There can be many permissive 
strategies for the same game with different “permissiveness”, 
i.e., including different subsets of winning strategies. For two 
different permissive strategies pi and p 2 for the system, if p 2 
includes p\, intuitively pa would be more “permissive” and 
have higher worst-case reward, although it is also expected to 
consume more computation resources. Thus there is a natural 
trade-off between “permissiveness” and optimality for the 
solution of this case, which is illustrated in Section [V] 


V. Examples 

We demonstrate the use of Algorithm Q] on robot motion 
planning examples in grid worlds with different sizes and 
winning specifications. The game in the first example has a 
maximally permissive strategy for the system as its specifi¬ 
cation is a safety formula, while for the second example we 
can at most compute a permissive strategy. The last example 
shows the trade-off between the performance of the learned 
system strategy of Algorithm |T] and the computation cost. 

Example 1: Two robots, namely a system robot and an 
environment robot, move in an N-by-N square grid world 
strictly in turns. It is known that the two robots are in 
different cells initially and at each move, the environment 
robot must go to an adjacent cell, while the system robot 
can either go to an adjacent cell or stay in its current cell. 
The system robot should always avoid collision with the 
environment robot. Assume that the positions of both robots 
are always observable for the system. 

This problem can be formulated as a game Q = 
{S, S s , S e , I, Ac, A uc , T, W} with W = (L, p 0 ). Let Pos = 
{0,..., N 1 2 — 1} be the set of cells in the map. Then 
S = Pos x Pos x {0,1}, Ss = Pos x Pos x { 1 }, 
S e = Pos x Pos x {0}. I = {(x,y, 1) | x,y € 
Pos,x ^ y}. A c = {up s , down s , left s , rights, stay s }, 
A uc = {up e ,down e ,left e ,right e }. The transition function 
T guarantees that A c and A uc only change the first and 
second component of a state respectively. The set of atomic 

propositions is AP = (Uilo ” 1 x <) u (Ujto _1 Vi) u {*o, h}- 
The labeling function is L(s) = {xi,yj,tk} if s = ( i,j, k ) G 
S. The requirements on the system robot can be expressed 

as <po = Ai=o h x i V-%) A □ A i=0 Oi ->■ “%)• Propo¬ 
sition [ 2 ] asserts that we can compute a maximally permissive 
strategy and construct Q. By Theorem Q] Algorithm Q] is 
expected to output an optimal strategy for the system. 

The reward functions ,/.£ and ,/.£ are given as O and Q, 
with the discounting factor 7 set to be 0.9. However, the 
instantaneous reward function TZ is a priori unknown to the 
system robot. In practical scenarios TZ is often given by some 
independent human operator or trainer of the system robot for 
unpredictable purposes with arbitrarily complicated structure 
and thus can neither be acquired nor be guessed ahead of 
time. For this numerical example, TZ is set to encourage the 
system robot to reach positions diagonal to the environment 
robot’s position as often as possible. From a state s £ S s , 
7 Z(s,a) = 1 if the two robots are diagonal to each other 
at T(s,a), otherwise 7 Z(s,a) = 0. But this information is 
not available to the system robot in advance and is only 
revealed through the learning process. The system robot can 
only get an instantaneous reward each time when it takes a 
corresponding transition. 

The results for the cases when N = 3,4,5,6,8,10 are 
shown in Table Q] where t e is the time [s] spent extracting 





TABLE I: Results for example 1. 


N 

t e [S] 

ti [s] 

Iterations 

\s\ 

|S 8 | 

3 

0.10 

4.28 

9 x 10 4 

120 

72 

4 

0.21 

16.35 

3.2 x 10 5 

432 

240 

5 

2.20 

43.12 

8.5 x 10 5 

1120 

600 

6 

19.40 

88.69 

1.81 x 10 e 

2400 

1260 

8 

30.29 

305.77 

6.05 x 10 e 

7840 

4032 

10 

300.00 

771.73 

1.562 x 10 7 

19440 

9900 


a maximally permissive strategy n™ ax with slugs, and f/ 
is the time [s] used to learn an optimal strategy p*. The 
number of states and state-action tuples are for the game Q 
in Algorithm [I] All examples run on a laptop with a 2.4GHz 
CPU and 8 GB memory. 

Now we illustrate the optimality of the learned greedy 
policy with the simulation result when N = 4, whose result 
is shown in Fig. [2] Let p be the greedy strategy of the 
system learned by the maximin-Q learning algorithm against 
an adversarial environment. If from a state s £ S s the system 
robot can only reach a diagonal position with respect to 
the position of the environment in at least k £ N steps, 
J^(/ i', s) is upper bounded by ”/ 1 ■ 1 = for any 

system strategy p!. By definition, if p* is an optimal strategy 
for the system against an adversarial environment, we have 
J%,(p',s) < Ji(p*,s) < Y~^/ k . In this 4-by-4 case, the 
system can always reach a diagonal position in 3 steps. Fig. [2] 
shows that V converges to the values 10, 9, 8.1 and 7.29, 
which coincide with when k = 0,1,2,3 and 7 = 0.9. 

Thus by the inequality above, J^(p,s) also coincides with 
indicating that p itself is an optimal strategy of 
the system, as predicted by Algorithm [Q 



Fig. 2: Results for Example 1 for N = 4: (Left) .§) 

for all s £ S s and the learned greedy strategy p\ (right) the 
logarithm of the maximal change in V in every 10 4 iterations. 

Example 2: Now we construct a new game Gi with a 
new winning condition W\ = (L. ipi ) from Q by adding 
liveness assumptions to the environment robot and liveness 


requirements to the system robot. To be more specific, we 
require the system robot to visit the upper left corner (cell 
N 2 — N) and the lower right corner (cell N — 1) infinitely 
often, provided that the environment robot visits the lower 
left corner (cell 0) and the upper right corner (cell N 2 — 1) 
infinitely often. Gi is the same as Q except that ipi = <p A 
((□O^o A lUOaTva-i) —> (DOt/Af-i A n<>y N 2_ N )). 

The definition of the instantaneous function 1Z remains the 
same as in Example 1 , and the learning result of Jj£(p,s) 
when N = 4 is given as Fig. [3 With this specification 
the system has no maximally permissive strategies, and it is 
expected that the true value of J ^ 1 (p, s) should be almost the 
same as s), as the system robot is allowed to follow 

p* for as many finite moves as desired. However, Fig. [3 
shows that J ^ 1 (p, s) is smaller than J^(p*, s), indicating a 
sub-optimality due to the loss of some winning strategies by 
the permissive strategy. 



Fig. 3: Result for Example 2 when N = 4: J ^ 1 (p, s) for all 
s £ S s and a learned greedy strategy p. 

Example 3: We now illustrate the trade-off between the 
performance of the learned strategy and the computation 
cost in Algorithm [3 Consider a new game G 2 with winning 
condition W 2 = ( L , <p 2 ) which is slightly different from 
the game Q of Example 1 as it also requires the system 
robot to visit one of two given cells (say cell N 2 — N 
and cell N — 1) infinitely often. In other words, tp 2 = 
if 0 A O‘0’(ypj2_ N V yN-i), which is in the form of GR(1). 
We compute a memoryless permissive strategy /i 2 for (/ 2 - 

Now we design a sequence of games G] to G-j from Q 
in the following way. For each game we add a counter as 
a new controlled state variable which counts the number of 
the system’s moves since its last visit to cell N 2 — N or cell 
N — 1, but the maximum allowed counter values increases 
monotonically from Q\ to G%- The value of each counter 
should always be less than its corresponding maximum value. 
All these 6 games satisfy the condition in Proposition [3 
and we can extract a maximally permissive strategy for 
each of them. With the counters, the system robot is forced 

















to visit cell N 2 — N or cell N — 1 infinitely often and 
as a result, any permissive strategies of any game in this 
sequence is also a permissive strategy for 0-2- Let jj,\ be 
the extracted maximally permissive strategy of the game Q\ 
for i = 1, - - ■ , 6 . By definition of maximally permissive 
strategies, includes n J 2 if i > j, i,j G {1, • ■ ■ , 6 }. In 
this way we extracted a sequence of permissive strategies 
with increasing permissiveness for the game G’i- 

We proceed the same learning procedure as the previous 
two examples on G 2 and the game sequence from Q\ to 
C/®- For the 3-by-3 case, the maximum allowed counter 
values and the maximum values of the learned discounted 
reward are shown in Table [Tl] It is shown that the maximum 
discounted reward, which can be seen as the performance 
of the learned system strategy, increases monotonically with 
the maximum counter value, i.e., the permissiveness of the 
permissive strategy. In the meantime, the number of learning 
iterations and computation time grows. This illustrates the 
trade-off between the performance of the learned strategy 
and the computation cost. 


TABLE II: Results for example 3 (for the 3-by-3 case). 


Strategy 

Max 

counter 

value 

Max 

discounted 

reward 

Learning 
time [s] 

Learning 

iterations 

[xlO 4 ] 

M2 

N/A 

5.7368 

9.87 

20 

M 2 

4 

8.0922 

9.70 

19 

M2 

6 

8.8658 

16.00 

33 

M2 

8 

9.2442 

27.34 

55 

M 2 

10 

9.4647 

41.54 

83 

M2 

14 

9.7034 

83.88 

172 

M2 

20 

9.8616 

275.88 

534 


VI. Concluding remarks 
We studied synthesis of optimal reactive controllers that 
are correct with respect to given temporal logic specifica¬ 
tions. The performance criteria are unknown during design 
but can be inferred at run time. We proposed a solution that 
merges ideas from permissive strategy synthesis and rein¬ 
forcement learning. We provided sufficient conditions (on the 
underlying temporal logic specifications) needed to acquire 
optimal performance, and demonstrated the algorithm on a 
number of robot motion planning examples. 
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